---
id: getting-started-rag
title: RAG Evaluation
sidebar_label: RAG
---

import { Timeline, TimelineItem } from "@site/src/components/Timeline";
import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
import CodeBlock from "@theme/CodeBlock";
import VideoDisplayer from "@site/src/components/VideoDisplayer";

Learn to evaluate retrieval-augmented-generation (RAG) pipelines and systems using `deepeval`, such as RAG QA, summarizaters, and customer support chatbots.

## Overview

RAG evaluation involves evaluating the retriever and generator as separately components. This is because in a RAG pipeline, the final output is only as good as the context you've fed into your LLM.

**In this 5 min quickstart, you'll learn how to:**

- Evaluate your RAG pipeline end-to-end
- Test the retriever and generator as separate components
- Evaluate multi-turn RAG

## Prerequisites

- Install `deepeval`
- A Confident AI API key (recommended). Sign up for one [here.](https://app.confident-ai.com)

:::info
Confident AI allows you to view and share your testing reports. Set your API key in the CLI:

```bash
CONFIDENT_API_KEY="confident_us..."
```

:::

## Run Your First RAG Eval

End-to-end RAG evaluation treats your entire LLM app as a standalone RAG pipeline. In `deepeval`, a single-turn interaction with your RAG pipeline is modelled as an LLM test case:

![LLM Test Case](https://deepeval-docs.s3.amazonaws.com/docs:llm-test-case.png)

The `retrieval_context` in the diagram above is cruical, as it represents the text chunks that were retrieved at evaluation time.

:::note

`deepeval` provides a wide selection of LLM models that you can easily choose from and run evaluations with.

<Tabs>

<TabItem value="openai" label="OpenAI">

```python
from deepeval.metrics import AnswerRelevancyMetric

task_completion_metric = AnswerRelevancyMetric(model="gpt-4.1")
```

</TabItem>

<TabItem value="anthropic" label="Anthropic">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import AnthropicModel

model = AnthropicModel("claude-3-7-sonnet-latest")
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="gemini" label="Gemini">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import GeminiModel

model = GeminiModel("gemini-2.5-flash")
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="azure-openai" label="Ollama">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import OllamaModel

model = OllamaModel("deepseek-r1")
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="grok" label="Grok">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import GrokModel

model = GrokModel("grok-4-0709")
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="azure" label="Azure OpenAI">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import AzureOpenAIModel

model = AzureOpenAIModel(
    model_name="gpt-4.1",
    deployment_name="Test Deployment",
    azure_openai_api_key="Your Azure OpenAI API Key",
    openai_api_version="2025-01-01-preview",
    azure_endpoint="https://example-resource.azure.openai.com/",
    temperature=0
)
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="amazon-bedrock" label="Amazon Bedrock">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import AmazonBedrockModel

model = AmazonBedrockModel(
    model_id="anthropic.claude-3-opus-20240229-v1:0",
    temperature=0
)
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

<TabItem value="vertex-ai" label="Vertex AI">

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.models import GeminiModel

model = GeminiModel(
    model_name="gemini-1.5-pro",
    project="Your Project ID",
    location="us-central1",
    temperature=0
)
task_completion_metric = AnswerRelevancyMetric(model=model)
```

</TabItem>

</Tabs>
:::

<Timeline>
<TimelineItem title="Setup RAG pipeline">

Modify your RAG pipeline to return the retrieved contexts alongside the
LLM response.

<Tabs>
<TabItem value="python" label="Python">

```python title=main.py showLineNumbers={true}
def rag_pipeline(input):
   ...
   return 'RAG output', ['retrieved context 1', 'retrieved context 2', ...]
```

</TabItem>
<TabItem value="langgraph" label="LangGraph">

```python title="main.py" showLineNumbers={true}
from langchain_core.messages import HumanMessage
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.load_local("./faiss_index", embeddings)
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model="gpt-4")

def rag_pipeline(input):
    # Extract retrieval context
    retrieved_docs = retriever.get_relevant_documents(input)
    context_texts = [doc.page_content for doc in retrieved_docs]

    # Generate response
    state = {"messages": [HumanMessage(content=input + "\\n\\n".join(context_texts))]}
    result = llm.invoke(state)
    return result["messages"][-1].content, context_texts
```

</TabItem>
<TabItem value="langchain" label="LangChain">

```python title="main.py" showLineNumbers={true}
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4")
vectorstore = Chroma(persist_directory="./chroma_db")
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

def rag_pipeline(input):
    # Extract retrieval context
    retrieved_docs = retriever.get_relevant_documents(input)
    context_texts = [doc.page_content for doc in retrieved_docs]

    # Generate response
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    result = qa_chain.invoke({"query": input})
    return result["result"], context_texts
```

</TabItem>
<TabItem value="llama_index" label="LlamaIndex">

```python title="main.py" showLineNumbers={true}
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def rag_pipeline(input):
    # Generate response
    response = query_engine.query(input)

    # Extract retrieval context
    context_texts = []
    if hasattr(response, 'source_nodes'):
        context_texts = [node.text for node in response.source_nodes]
    return str(response), context_texts
```

</TabItem>
</Tabs>

:::info
Instead of changing your code to return these data, we'll show a better way to run RAG evals in the next section.
:::

</TimelineItem>
<TimelineItem title="Create a test case">

Create a test case using retrieval context and LLM output from your RAG pipeline. Optionally provide an expected output if you plan to use [contextual precision](/docs/metrics-contextual-precision) and [contextual recall](/docs/metrics-contextual-recall) metrics.

```python title=main.py {1,4}
from deepeval.test_case import LLMTestCase

input = 'How do I purchase tickets to a Coldplay concert?'
actual_output, retrieved_contexts = rag_pipeline(input)

test_case = LLMTestCase(
    input=input,
    actual_output=actual_output,
    retrieval_context=retrieved_contexts,
    expected_output='optional expected output'
)
```

</TimelineItem>
<TimelineItem title="Define metrics">

Define RAG metrics to evaluate your RAG pipeline, or define your own using [G-Eval](/docs/metrics-llm-evals).

```python
from deepeval.metrics import AnswerRelevancyMetric, ContextualPrecisionMetric

answer_relevancy = AnswerRelevancyMetric(threshold=0.8)
contextual_precision = ContextualPrecisionMetric(threshold=0.8)
```

<details>
<summary>What RAG metrics are available?</summary>

DeepEval offers a total of 5 RAG metrics, which are:

- [Answer Relevancy](/docs/metrics-answer-relevancy)
- [Faithfulness](/docs/metrics-faithfulness)
- [Contextual Relevancy](/docs/metrics-contextual-relevancy)
- [Contextual Precision](/docs/metrics-contextual-precision)
- [Contextual Recall](/docs/metrics-contextual-recall)

Each metric measures a [different parameter](/guides/guides-rag-evaluation) in your RAG pipeline's quality, and each can help you determine the best prompts, models, or retriever settings for your use-case.

</details>

</TimelineItem>
<TimelineItem title="Run an evaluation">

Run an evaluation on the LLM test case you previously created using the metrics defined above.

```python title="main.py" showLineNumbers={true}
from deepeval import evaluate
...

evaluate([test_case], metrics=[answer_relevancy, contextual_precision])
```

🎉🥳 **Congratulations!** You've just ran your first RAG evaluation. Here's what happened:

- When you call `evaluate()`, `deepeval` runs all your `metrics` against all `test_cases`
- All `metrics` outputs a score between `0-1`, with a `threshold` defaulted to `0.5`
- Metrics like `contextual_precision` evaluates based on the `retrieval_context`, whereas `answer_relevancy` checks the `actual_output` of your test case
- A test case passes only if all metrics passess

This creates a test run, which is a "snapshot"/benchmark of your RAG pipeline at any point in time.

</TimelineItem>

<TimelineItem title="Viewing on Confident AI (recommended)">

If you've set your `CONFIDENT_API_KEY`, test runs will appear automatically on [Confident AI](https://app.confident-ai.com), the DeepEval platform.

<VideoDisplayer src="https://deepeval-docs.s3.us-east-1.amazonaws.com/getting-started%3Arag.mp4" />

:::tip

If you haven't logged in, you can still upload the test run to Confident AI from local cache:

```bash
deepeval view
```

:::

</TimelineItem>
</Timeline>

## Evaluate Retriever

`deepeval` allows you to evaluate RAG components individually. This also means you don't have to return `retrieval_context`s in awkward places just to feed data into the `evaluate()` function.

<Timeline>
<TimelineItem title="Trace your retriever">

Attach the `@observe` decorator to functions/methods that make up your retriever. These will represent individual components in your RAG pipeline.

```python title=main.py showLineNumbers={true}  {3,6,10}
from deepeval.tracing import observe

@observe()
def retriever(input):
    # Your retriever implemetation goes here
    pass
```

:::info important
Set the `CONFIDENT_TRACE_FLUSH=YES` in your CLI to prevent traces from being lost in case of an early program termination.

```bash
export CONFIDENT_TRACE_FLUSH=YES
```

:::

</TimelineItem>
<TimelineItem title="Define metrics & test cases">

Create a retriever focused metric. You'll then need to:

1. Add it to your component
2. Create an `LLMTestCase` in that component with `retrieval_context`

```python title=main.py showLineNumbers={true} {6,10}
from deepeval.tracing import observe, update_current_span
from deepeval.metrics import ContextualRelevancyMetric

contextual_relevancy = ContextualRelevancyMetric(threshold=0.6)

@observe(metrics=[contextual_relevancy])
def retriever(query):
    # Your retriever implemetation goes here
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=["..."])
    )
    pass
```

</TimelineItem>

<TimelineItem title="Run an evaluation">

Finally, use the `dataset` iterator to invoke your RAG system on a list of goldens.

```python title=main.py showLineNumbers={true} {5,8}
from deepeval.dataset import EvaluationDataset, Golden
...

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='This is a test query')])

# Loop through dataset
for golden in dataset.evals_iterator():
    retriever(golden.input)
```

✅ Done. With this setup, a simple for loop is all that's required.

:::tip
You can also evaluate your retriever if it is nested within a RAG pipeline:

```python showLineNumbers {14}
from deepeval.dataset import EvaluationDataset, Golden
...

def rag_pipeline(query):
    @observe(metrics=[contextual_relevancy])
    def retriever(query):
        pass

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='This is a test query')])

# Loop through dataset
for golden in dataset.evals_iterator():
    rag_pipeline(golden.input)
```

:::

</TimelineItem>

</Timeline>

## Evaluate Generator

The same applies to evaluating the generator of your RAG pipeline, only this time you would trace your generator with metrics focused on your generator instead.

<Timeline>
<TimelineItem title="Trace your generator">

Attach the `@observe` decorator to functions/methods that make up your generator:

```python title=main.py showLineNumbers={true}  {3,6,10}
from deepeval.tracing import observe

@observe()
def generator(query):
    # Your retriever implemetation goes here
    pass
```

</TimelineItem>
<TimelineItem title="Define metrics & test cases">

Create a generator focused metric. You'll then need to:

1. Add it to your component
2. Create an `LLMTestCase` with the required parameters

For example, the `FaithfulnessMetric` requires `retrieval_context`, while `AnswerRelevancyMetric` doesn't.

```python title=main.py showLineNumbers={true} {6,9}
from deepeval.tracing import observe, update_current_span
from deepeval.metrics import AnswerRelevancyMetric

answer_relevancy = AnswerRelevancyMetric(threshold=0.6)

@observe(metrics=[answer_relevancy])
def generator(query, text_chunks):
    # Your retriever implemetation goes here
    update_current_span(test_case=LLMTestCase(input=query, actual_output="..."))
    pass
```

</TimelineItem>

<TimelineItem title="Run an evaluation">

Finally, use the `dataset` iterator to invoke your RAG system on a list of goldens.

```python title=main.py showLineNumbers={true} {5,8}
from deepeval.dataset import EvaluationDataset, Golden
...

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='This is a test query')])

# Loop through dataset
for golden in dataset.evals_iterator():
    generator(golden.input)
```

✅ Done. You just learnt how to evaluate the generator as a standalone.

:::info
You can also combine retriever and generator evals:

```python showLineNumbers {7,11,21}
from deepeval.dataset import EvaluationDataset, Golden
...

def rag_pipeline(query):
    @observe(metrics=[contextual_relevancy])
    def retriever(query) -> list[str]:
        update_current_span(test_case=LLMTestCase(input=query, retrieval_context=["..."]))

    @observe(metrics=[answer_relevancy])
    def generator(query, text_chunks):
        update_current_span(test_case=LLMTestCase(input=query, actual_output="..."))

    text_chunks = retriever(query)
    return generator(query, text_chunks)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='This is a test query')])

# Loop through dataset
for golden in dataset.evals_iterator():
    rag_pipeline(golden.input)
```

<VideoDisplayer src="https://deepeval-docs.s3.us-east-1.amazonaws.com/getting-started%3Arag-evals%3Acomponent.mp4" />

:::

</TimelineItem>

</Timeline>

## Multi-Turn RAG Evals

`deepeval` also lets you evaluate RAG in multi-turn systems. This is especially useful for chatbots that rely on RAG to generate responses, such as customer support chatbots.

:::note
You should first read [this section](/docs/getting-started-chatbots) on multi-turn evals if you haven't already.
:::

<Timeline>

<TimelineItem title="Create a test case">

Create a `ConversationalTestCase` by passing in a list of `Turn`s from an existing conversation, similar to OpenAI's message format.

```python title=main.py showLineNumbers={true} {1,9,15}
from deepeval.test_case import ConversationalTestCase, Turn

test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="I'd like to buy a ticket to a Coldplay concert."),
        Turn(
            role="assistant",
            content="Great! I can help you with that. Which city would you like to attend?",
            retrieval_context=["Concert cities: New York, Los Angeles, Chicago"]
        ),
        Turn(role="user", content="New York, please."),
        Turn(
            role="assistant",
            content="Perfect! I found VIP and standard tickets for the Coldplay concert in New York. Which one would you like?",
            retrieval_context=["VIP ticket details", "Standard ticket details"]
        )
    ]
)
```

Since your chatbot uses RAG, each turn from the assistant should also include the `retrieval_context` parameter.

</TimelineItem>
<TimelineItem title="Create a metric">

Define a custom conversational RAG metric to evaluate your chatbot system using [Conversational G-Eval](/docs/metrics-conversational-g-eval).

```python
from deepeval.metrics import ConversationalGEval
from deepeval.test_case import TurnParams

turn_faithfulness = ConversationalGEval(
    name="Faithfulness",
    criteria="Determine whether the assistant's responses are factually supported by the retrieved context across the entire conversation.",
    evaluation_params=[TurnParams.ROLE, TurnParams.CONTENT, TurnParams.RETRIEVAL_CONTEXT],
)
```

</TimelineItem>
<TimelineItem title="Run an evaluation">

Run an evaluation on the test case using the `evaluate` function and the conversational RAG metric you've defined.

```python title="main.py" showLineNumbers={true}
from deepeval import evaluate
...

evaluate([test_case], metrics=[conversational_faithfulness])
```

Finally, run `main.py`:

```bash
python main.py
```

✅ Done. There are lots of details we left out from this multi-turn section, such as how to simulate user interactions instead, which you can find more [here.](/docs/getting-started-chatbots)

<VideoDisplayer src="https://deepeval-docs.s3.us-east-1.amazonaws.com/getting-started%3Arag-evals%3Aconversation.mp4" />

</TimelineItem>
</Timeline>

## Next Steps

Now that you have run your first RAG evals, you should:

1. **Customize your metrics**: Include all 5 [RAG metrics](/docs/metrics-introduction) based on your use case.
2. **Prepare a dataset**: If you don't have one, [generate one](/docs/synthesizer-introduction) as a starting point.
3. **Enable evals in production**: Just replace `metrics` in `@observe` with a [`metric_collection`](https://www.confident-ai.com/docs/llm-tracing/evaluations#online-evaluations) string on Confident AI.

You'll be able to analyze performance over time on **threads** this way, and add them back to your evals dataset for further evaluation.

<VideoDisplayer
  src="https://confident-docs.s3.us-east-1.amazonaws.com/llm-tracing:traces.mp4"
  confidentUrl="/docs/llm-tracing/introduction"
  label="RAG Evaluation in Production"
/>
